Statements

Acknowledgement
I am sincerely thank the OpenGeoHub Foundation for their work in making open source geospatial models for everyone. The videos posted in their website on tutorials to learn remote sensing concepts and understand the workflow structures in working with geospatial data analysis. Particular mention to the tutorial “Machine Learning for Spatial Data” by Dr. Hanna Meyer which helped me to work on this project.

Furthermore, I thank the professional certification courses relevant to understanding programming, machine learning, remote sensing concepts which helped me to work on this project.

Use of generative artificial intelligence
Generative artificial intelligence (GenAI) was mainly used for creating charts and adjusting visualization parameters in R. GenAI was also used for code debugging. However, the responses provided by GenAI were critically judged before being implemented.

Executive Summary

Problem Statement
Gorse (Ulex europaeus) plant was brought to New Zealand during the European settlement as a hedge plant and spreads into farmlands which is associated with negative consequences for the quality of the grassland. Each year millions of dollars are spent for its control and field sampling are expensive for developing management strategies. It’s of high importance to map the distribution of such invasive plants using remote sensing methods.

Project Aim and Focus
The project aim and focus is on using Machine Learning models to predict the invasive species gorse (Ulex europaeus) on the Banks Peninsula of New Zealand.

Raw data used
Sentinel satellite data and plotted land cover class of a section of the Banks Peninsula of New Zealand.

Methodology
The following methodology was undertaken for this project,
- Raw data and land cover classes are visualized and merged to a single data set to build the Machine Learning models (Random Forest, XGBoost and Support Vector Machine).
- The merged data set is split into 30% training and 70% test data which is used to train and predict using the models.
- Various analysis such as confusion matrix, scoring metrics, predictive & entropy maps, and uncertainty diagnostics were performed to analyse the models performance in land cover classification and predicting the invasive plant species.

Results
Out of the three models, RF and XGBoost model performs very well and SVM model is not up to par in classifying the land cover class using the satellite images. Based on the various analysis, Random Forest with the highest accuracy of 99.33%, Model outperforms XGBoost in predicting the invasive species gorse (Ulex europaeus), using Sentinel satellite data, on the Banks Peninsula of NewZealand.

1 Introduction

This project demonstrates land cover classifications in R using machine learning algorithms such as Random Forest, XGBoost, and SVM, and compares their performance. This was adapted from the tutorial presented by Dr. Hanna Meyer in July 2018 and has been adapted to include additional machine learning models for comparison.

The focus is on land cover classification using sentinel satellite data, specifically targeting the invasive species gorse (Ulex europaeus) on the Banks Peninsula. This plant was brought to New Zealand during the European settlement as a hedge plant and spreads into farmland which is associated with negative consequences for the quality of the grassland.

Each year millions of dollars are spent for its control. To develop management strategies, it’s of high importance to map the distribution of such invasive plants. Since field sampling means huge expenses, remote sensing methods are required for a spatially explicit monitoring. In this project, Sentinel satellite imagery from 2017 were used to map the current land cover distribution on a section of the Banks Peninsula.

2 Aim and Methodology of this Project

The project aim is to predict the invasive species gorse (Ulex europaeus) on the Banks Peninsula and focus on using machine learning algorithms such as Random Forests, XGBoost and SVM to perform this project.

The following methodology was undertaken for this project,

  • Raw data and land cover classes are visualized and merged to a single data set to build the Machine Learning models (Random Forest, XGBoost and Support Vector Machine).

  • The merged data set is split into 30% training and 70% test data which is used to train and predict using the models.

  • Various analysis such as confusion matrix, scoring metrics, predictive & entropy maps, and uncertainty diagnostics were performed to analyse the models performance in land cover classification and predicting the invasive plant species.

3 Exploratory Data Analysis - EDA

This project shows the process of land cover (LC) classification using machine learning techniques using R. The focus is on classifying land cover types, particularly the invasive species gorse (Ulex europaeus), using Sentinel satellite data.

3.1 Load required libraries

First, loading the libraries and packages that are needed for this LC classification project. The selected libraries provide functions for handling raster data, reading shapefiles, performing machine learning tasks, and visualizing results.

knitr::opts_chunk$set(echo = TRUE)

# Spatial and Raster data handling packages
library(raster)       # Handling Raster data 
library(sf)           # Handling Vector data 
library(terra)        # Optimise large raster datasets 
library(mapview)      # Quick raster/vector visualization

# Data Manipulation & Reshaping
library(dplyr)        # Data wrangling 
library(tidyr)        # Data reshaping
library(reshape2)     # Data reshaping (melt/cast)
library(ROSE)         # Balancing imbalanced datasets 

# Model Building & Machine Learning packages
library(caret)        # Model building
library(randomForest) # Random Forest
library(xgboost)      # XGBoost
library(e1071)        # SVM 
library(rpart)        # Recursive partitioning and Regression Trees
library(ranger)       # Faster implementaion of RF 

# Model Evaluation and Interpretation packages
library(ModelMetrics) # Performance Metrics 
library(pROC)         # ROC and AUC curves 
library(DALEX)        # Model explainability: 
library(ingredients)  # Aligns with DALEX
library(vip)          # Visualizing feature importance 


# Visualization, Plotting & Reporting packages
library(ggplot2)      # Powerful and flexible plotting 
library(RColorBrewer) # Predefined color palettes
library(rasterVis)    # Customized raster visualization
library(kableExtra)   # Table styling
library(viridis)      # Colorblind-friendly maps
library(tinytex)      # Compile LaTeX docs

3.2 Data Loading and Pre-processing

3.2.1 Load and explore the data

To start with, loading the Sentinel data as well as a land cover class of the training sites.

# Load Sentinel data
sentinel <- stack("D:\\Study\\Machine Learning\\Geostatistics\\R-Git\\Case-Study_ML-for-Spatial-Data\\Data\\sentinel2017.grd")

# Print the Sentinel data
print(sentinel)
## class      : RasterStack 
## dimensions : 1257, 1405, 1766085, 10  (nrow, ncol, ncell, nlayers)
## resolution : 10, 10  (x, y)
## extent     : 655950, 670000, 5138290, 5150860  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=59 +south +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## names      :          B02,          B03,          B04,          B08,          B05,          B06,          B07,          B8A,         NDVI,   yellowness 
## min values :  619.0000000,  373.0000000,  194.0000000,   98.0000000,  168.3125000,  155.1875000,  139.0625000,  100.1875000,   -0.6810878,   -0.1066810 
## max values : 5385.0000000, 5137.0000000, 5289.0000000, 6984.0000000, 4209.0625000, 5333.6875000, 6705.6875000, 7297.2500000,    0.8989085,    0.5362146
# Load training sites shapefile
training <- read_sf("D:\\Study\\Machine Learning\\Geostatistics\\R-Git\\Case-Study_ML-for-Spatial-Data\\Data\\trainingSites.shp")

# Print the training data
print(training)
## Simple feature collection with 23 features and 2 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 656358 ymin: 5138445 xmax: 669091.9 ymax: 5150851
## Projected CRS: WGS 84 / UTM zone 59S
## # A tibble: 23 × 3
##       id Class                                                          geometry
##    <dbl> <chr>                                                     <POLYGON [m]>
##  1     1 Gorse     ((660956.4 5146064, 661018.8 5146061, 661033.1 5146008, 6609…
##  2     2 Gorse     ((661120.6 5146267, 661197.3 5146263, 661205.7 5146231, 6611…
##  3     3 Gorse     ((662279.2 5148528, 662388.2 5148465, 662382.2 5148421, 6622…
##  4     4 Gorse     ((663908.7 5148424, 664023.7 5148401, 664054.9 5148404, 6641…
##  5     5 Gorse     ((662357.1 5145141, 662514 5145113, 662547.6 5145081, 662636…
##  6     6 Water     ((668392.1 5144225, 669091.9 5143123, 668248.4 5143200, 6683…
##  7     7 Water     ((665171.5 5138877, 666369.7 5138445, 665401.6 5138522, 6651…
##  8     8 Water     ((656358 5147753, 656722.2 5147403, 656444.2 5147446, 656358…
##  9     9 Grassland ((658817.8 5146617, 658875.3 5146725, 658936.4 5146717, 6589…
## 10    10 Grassland ((659407.2 5140035, 659565.4 5139956, 659539 5139786, 659457…
## # ℹ 13 more rows

Overview of the data:
The Sentinel data is a stack of raster layers, The sentinel data subset for this tutorial contains the Sentinel channels 2-8 (visible and near infrared channels) as well as the NDVI and a yellowness index that was calculated as (red+green-blue)/(red+green+blue) as additional bands. Assume the yellowness index is valuable to distinguish the striking yellow color of the gorse from other vegetation.To get an idea about the band composition and spatial resolution, please check the Sentinel Wikipedia documentation here. The training sites are polygons that were digitized in QGIS on the basis of the Sentinel data using expert knowledge. The training dataset contains different land cover types, including gorse, grassland, and other vegetation types.

Visualize the data:
Visualize the Sentinel data as a true color composite in the geographical context and overlay it with the polygons. Click on the polygons to see which land cover class is assigned to a respective polygon.

# Visualize Sentinel data with training sites
viewRGB(sentinel, r=3, g=2, b=1, map.types = "Esri.WorldImagery") + 
  mapview(training)

3.2.2 Data Pre-processing

Preparing the data for machine learning by extracting raster values for the training sites and merging them with the land cover class information from the training shape file. This step is crucial for training the machine learning model. Then, splitting the data into training and test data set to ensure the evaluation of the model’s performance.

Extract raster information:
In order to train the Machine Learning model between the spectral properties and the land cover class, create a data frame that contains the spectral properties of the training sites as well as the corresponding class information. This data frame can be produced with the extract function. The resulting data frame contains the Sentinel data for each pixel overlayed by the polygons. This data frame then still needs to be merged with the information on the land cover class from the shape file. This happens via the ID of the polygons which are given in the extracted data frame by the column “ID” and in the shape file by the attribute “id”.

# Extract raster values for training sites
extr <- raster::extract(sentinel, training, df=TRUE)

# Merge extracted data with training sites
extr <- merge(extr, training, by.x="ID", by.y="id")
kable(head(extr), caption = "Merged Data") %>%
  kable_styling(full_width = F, position = "left")
Merged Data
ID B02 B03 B04 B08 B05 B06 B07 B8A NDVI yellowness Class geometry
1 767 1113 1095 2854 1425.688 2358.625 2791.750 3091.375 0.4454292 0.4843698 Gorse POLYGON ((660956.4 5146064,…
1 767 1049 1029 2667 1409.000 2320.438 2742.688 3037.125 0.4431818 0.4608084 Gorse POLYGON ((660956.4 5146064,…
1 767 956 910 2444 1367.500 2213.312 2617.562 2923.375 0.4573644 0.4173946 Gorse POLYGON ((660956.4 5146064,…
1 760 944 918 2418 1347.500 2184.625 2574.375 2874.875 0.4496403 0.4202898 Gorse POLYGON ((660956.4 5146064,…
1 743 1017 981 2543 1349.000 2234.375 2613.125 2891.625 0.4432463 0.4578621 Gorse POLYGON ((660956.4 5146064,…
1 737 991 939 2648 1352.000 2256.250 2637.562 2922.500 0.4764427 0.4473191 Gorse POLYGON ((660956.4 5146064,…

Splitting data into training and test data:
In order to keep data for a later (nearly) independent validation as well as to limit the number of data points so that the model training won’t take long time, splitting the total data set into 30% training data and 70% test data is necessary. The more testing (unknown) data the higher the model accuracy. Caret’s createDataPartition takes care that the class distribution is the same in both data sets.

# Set seed for reproducibility
set.seed(100)

# Create training and test dataset
trainids <- createDataPartition(extr$Class, list=FALSE, p=0.3)
trainDat <- extr[trainids,]
testDat <- extr[-trainids,]

# Ensure Class is a factor and is at the same level
trainDat$Class <- factor(trainDat$Class)
testDat$Class <- factor(testDat$Class)

# Check the distribution of classes in training and test datasets
training_class = round(prop.table(table(trainDat$Class)) * 100, 2)
test_class = round(prop.table(table(testDat$Class)) * 100, 2)

kable(training_class, caption = "Training class balance") %>%
  kable_styling(full_width = F, position = "left")
Training class balance
Var1 Freq
Bare 5.73
Bush 9.21
Forestry 9.27
Gorse 7.96
Grassland 6.41
Sand 2.81
Urban 2.28
Water 56.33
kable(test_class, caption = "Test class balance") %>%
  kable_styling(full_width = F, position = "left")
Test class balance
Var1 Freq
Bare 5.71
Bush 9.20
Forestry 9.27
Gorse 7.96
Grassland 6.39
Sand 2.77
Urban 2.29
Water 56.40

Class balance inferences shows that there are severe class imbalance within both training and test data sets. It is required to balance the class using up-sampling as the Water has a high frequency and the remaining classes are similar to each other and much lower that the Water.

Visualize relationships:
To get an idea about the relationships between the spectral Sentinel data and the land cover class, visualize how the different bands differs according to the land cover class. This helps us understand the distribution of different classes in the feature space.

# Visualize relationships (box plot) between yellowness index and land cover class
ggplot(trainDat, aes(x = factor(Class), y = yellowness)) +
  geom_boxplot() +
  labs(title = "Yellowness Index by Land Cover Class",
       x = "Land Cover Class",
       y = "Yellowness Index") +
  theme_minimal()

# Visualize relationships (box plot) between NDVI and land cover class
ggplot(trainDat, aes(x = factor(Class), y = NDVI)) +
  geom_boxplot() +
  labs(title = "NDVI by Land Cover Class",
       x = "Land Cover Class",
       y = "NDVI") +
  theme_minimal()

The box plots show the distribution of the yellowness index and NDVI for each land cover class, providing insights into how these indices vary across different classes. The Gorse class features the highest “yellowness” values while all other land cover classes have considerably lower values.

Feature Plot:
To get an impression about the separability of the classes by creating a feature plot that visualizes the location of the training samples in a scatter plot of two Sentinel channels.

# Create feature plot to visualize relationships between Sentinel channels (B08 vs yellowness) and land cover class
ggplot(trainDat, aes(x = B08, y = yellowness, color = factor(Class))) +
  geom_point() +
  labs(title = "Feature Plot: Sentinel Band 8 vs Yellowness",
       x = "Sentinel Band 8 (NIR)",
       y = "Sentinel Band Yellowness",
       color = "Land Cover Class") +
  theme_minimal()

The feature plot shows the distribution of training samples in the feature space defined by Sentinel Band 8 (NIR) and the yellowness index. It helps us visualize how well the different land cover classes can be separated based on these features. The Gorse class features the highest “yellowness” values while all other land cover classes have considerably lower values. Based on the feature plot, note that there is only low separability considering Sentinel channel 3 and 4 (green and red) but high separability when channel 8 (near infrared) or the yellowness index is included.

4 Machine Learning Model

The split data is now ready for training machine learning models and ready for implementation in the three different algorithms: Random Forest, XGBoost, and Support Vector Machine (SVM). Each model will be trained on the training data set and evaluated on the test data set.

Selecting predictor and response variable:
Before building the ML models, all the Sentinel bands and indices were selected as the predictor variables and the land cover class were selected as the response variables. The response variable is categorical, while the predictor variables are continuous. From the data, select all the information from the sentinel raster data as the predictor variables and the land cover class as the response variable.

# Select predictor variables (Sentinel bands and indices)
predictors <- c("B02", "B03", "B04", "B05", "B06", "B07", "B08", "NDVI", "yellowness")

# Select response variable (land cover class)
response <- 'Class'

4.1 Random Forest

Random Forest is a popular ensemble learning method that constructs multiple decision trees and merges them together to get a more accurate and stable prediction. It is particularly effective for classification tasks and suits this LC classification analysis. Training the Random Forest model using the randomForest package in R. The model is trained on the training data set, and visualize the variable importance to understand which features contribute most to the classification. train function from the caret package was used to train the model with 10-fold cross-validation for hyperparameter tuning.

# Ensure response is a factor
trainDat[, response] <- as.factor(trainDat[, response])

# Define hyperparameter tuning grid 
rf_grid <- expand.grid(
  mtry = c(2, 3, 4),  # Number of variables randomly sampled at each split
  splitrule = "gini", # Splitting rule
  min.node.size = c(1, 5) # Minimum size of terminal nodes
)

# Setting train control
control <- trainControl(
  method = "cv",         # k-fold CV
  number = 10,           # 10-fold
  summaryFunction = multiClassSummary,  # for multiclass metrics
  classProbs = TRUE,
  savePredictions = "final",
  sampling = "up"
)

# Train Random Forest model
set.seed(100)
model_rf <- train(trainDat[, predictors], trainDat[, response], 
                  method = "rf", 
                  trControl = control,
                  importance = TRUE)
# Print model summary
kable(model_rf$results, caption = "Random Forest Model Results") %>%
  kable_styling(full_width = F, position = "left")
Random Forest Model Results
mtry logLoss AUC prAUC Accuracy Kappa Mean_F1 Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy logLossSD AUCSD prAUCSD AccuracySD KappaSD Mean_F1SD Mean_SensitivitySD Mean_SpecificitySD Mean_Pos_Pred_ValueSD Mean_Neg_Pred_ValueSD Mean_PrecisionSD Mean_RecallSD Mean_Detection_RateSD Mean_Balanced_AccuracySD
2 0.0194079 0.9996537 0.3135054 0.9944470 0.9914640 0.9731647 0.9717758 0.9992815 0.9772389 0.9992905 0.9772389 0.9717758 0.1243059 0.9855286 0.0078280 0.0003859 0.0337306 0.0039935 0.0061337 0.0194530 0.0201904 0.0005168 0.0180031 0.0005101 0.0180031 0.0201904 0.0004992 0.0103524
5 0.0192705 0.9996919 0.2236961 0.9932740 0.9896542 0.9682776 0.9671660 0.9991251 0.9747691 0.9991392 0.9747691 0.9671660 0.1241592 0.9831455 0.0135754 0.0004931 0.0369808 0.0051617 0.0079476 0.0263027 0.0261678 0.0006728 0.0214679 0.0006581 0.0214679 0.0261678 0.0006452 0.0134152
9 0.0582227 0.9964809 0.1037647 0.9941511 0.9910037 0.9739457 0.9729646 0.9992407 0.9772993 0.9992473 0.9772993 0.9729646 0.1242689 0.9861027 0.0727093 0.0055828 0.0303429 0.0043519 0.0066949 0.0173850 0.0177504 0.0005681 0.0163332 0.0005644 0.0163332 0.0177504 0.0005440 0.0091501
# Extract variable importance
importance_df <- varImp(model_rf)$importance

# Add variable names as a column
importance_df$Variable <- rownames(importance_df)

# Compute overall importance (if multiple classes)
importance_df$Overall <- rowMeans(importance_df[, sapply(importance_df, is.numeric)])

# Round and sort in descending order
importance_df <- importance_df %>%
  mutate(Overall = round(Overall, 3)) %>%
  arrange(desc(Overall))  # High to low

# View top variables
kable(importance_df, caption = "Variable Importance for Random Forest Model") %>%
  kable_styling(full_width = F, position = "left")
Variable Importance for Random Forest Model
Bare Bush Forestry Gorse Grassland Sand Urban Water Variable Overall
NDVI 78.81922 51.74690 42.84082 67.26438 54.32107 82.43570 96.46106 27.60330 NDVI 62.687
yellowness 80.57554 44.51956 41.20206 75.49890 40.48474 53.72358 82.26013 20.75973 yellowness 54.878
B05 57.84457 41.90438 38.77014 43.10041 40.90441 99.16067 79.66702 33.45632 B05 54.351
B07 70.04270 42.26636 33.83277 42.12963 35.77620 64.99589 84.73350 29.86652 B07 50.455
B06 65.81163 41.86432 32.35504 40.79115 30.54126 64.34196 84.33848 34.24737 B06 49.286
B04 59.79071 51.15562 31.61819 54.23198 30.60026 64.23863 77.03137 21.19083 B04 48.732
B02 93.18263 19.81698 15.63798 57.58772 16.88825 67.45459 100.00000 14.20616 B02 48.097
B08 54.83551 37.96788 31.79415 32.47370 29.06082 67.23859 92.70732 31.65667 B08 47.217
B03 70.19851 36.28995 41.09424 45.42369 33.77928 65.11085 72.26431 0.00000 B03 45.520
# Plot variable importance
ggplot(importance_df, aes(x = reorder(Variable, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Variable Importance for RF model (High to Low)",
       x = "Variable", y = "Importance Score") +
  theme_minimal()

The Random Forest model is successfully trained and variable importance is noted for the determining which predictor variable (sentinel bands) has overall contributions to the classification. The model achieved excellent performance across multiple values of mtry, with the least logLoss results at 5. At mtry at 5, the model shows,

  • Least uncertainity (logLoss = 0.0192),

  • Nearly perfect classification (AUC score = 0.999),

  • High classification (accuracy = 0.9933),

  • Good overall class performance (balanced accuracy = 0.9831) and

  • Balanced precision and recall (mean F1 score = 0.968).

The variable importance scores for each predictor variable is derived. The variable importance plot shows which features contribute most to the classification task, with the NDVI, Yellowness & B05 being among the top 3 important predictors.

Predicting the test data with confusion matrix and scoring metrics:
To evaluate the performance of the Random Forest model on the test data set, confusion matrix and scoring metrics were calculated such as accuracy, sensitivity, and specificity. The confusion matrix provides insights into how well the model performs on the test data set.

# Predict on the test dataset
predictions_rf <- predict(model_rf,
                          newdata = testDat[, predictors])

# Extract the levels from the training data response
true_levels <- levels(trainDat[[response]])

# Convert both predicted and actual responses to factors with the same levels
predictions_rf <- factor(predictions_rf, levels = true_levels)
actual_rf <- factor(testDat[[response]], levels = true_levels)

# Construct the confusion matrix
confusion_rf <- caret::confusionMatrix(predictions_rf, actual_rf)

# Plot confusion matrix
cm_plot_rf <- ggplot(as.data.frame(confusion_rf$table), 
                     aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Freq), color = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  geom_text(aes(label = Freq), color = "black") +
  labs(title = "Confusion Matrix for Random Forest Model",
       x = "Reference Class",
       y = "Predicted Class") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

cm_plot_rf

# Print scoring metrics
kable(confusion_rf$overall, caption = "Scoring Metrics for Random Forest Model") %>%
  kable_styling(full_width = F, position = "left")
Scoring Metrics for Random Forest Model
x
Accuracy 0.9949774
Kappa 0.9922706
AccuracyLower 0.9931669
AccuracyUpper 0.9964094
AccuracyNull 0.5640382
AccuracyPValue 0.0000000
McnemarPValue NaN
# Print sensitivity and specificity for each class
kable(confusion_rf$byClass, caption = "Sensitivity and Specificity for Random Forest Model") %>%
  kable_styling(full_width = F, position = "left")
Sensitivity and Specificity for Random Forest Model
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: Bare 0.9868132 0.9985351 0.9760870 0.9992004 0.9760870 0.9868132 0.9814208 0.0571321 0.0563787 0.0577599 0.9926741
Class: Bush 1.0000000 0.9998617 0.9986376 1.0000000 0.9986376 1.0000000 0.9993183 0.0920392 0.0920392 0.0921647 0.9999309
Class: Forestry 0.9986450 1.0000000 1.0000000 0.9998616 1.0000000 0.9986450 0.9993220 0.0926670 0.0925414 0.0925414 0.9993225
Class: Gorse 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.0796082 0.0796082 0.0796082 1.0000000
Class: Grassland 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.0639126 0.0639126 0.0639126 1.0000000
Class: Sand 0.9457014 0.9987085 0.9543379 0.9984506 0.9543379 0.9457014 0.9500000 0.0277499 0.0262431 0.0274987 0.9722049
Class: Urban 0.8846154 0.9976870 0.8994413 0.9973025 0.8994413 0.8846154 0.8919668 0.0228528 0.0202160 0.0224761 0.9411512
Class: Water 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.5640382 0.5640382 0.5640382 1.0000000

Results:
Confusion Matrix and scoring metrics provide insights into the performance of the Random Forest model on the test data set. The confusion matrix shows how many instances were correctly classified for each land cover class, while the scoring metrics (accuracy, sensitivity, specificity) quantify the model’s overall performance.

Predictions on the test/unseen data, confusion matrix shows high true positives for all the classes predictions with an overall accuracy of 99.5%, but do have some irregularities in the classification. From the confusion matrix,

  • Water, Grassland, Gorse, & Bush are 100% predicted and classified correctly.

  • Out of 738 Forestry Class, 1 was predicted as Bush

  • Out of 182 Urban Class, 10 were predicted as Sand and 11 as Bare.

  • Out of 221 Sand Class, 12 were predicted as Urban.

  • Out of 455 Bare class, 6 were predicted as Urban.

This means that the Urban, Bare and Sand show very little false positives with the RF model.

4.2 XGBoost (Extreme Gradient Boosting)

XGBoost (Extreme Gradient Boosting) is a powerful and efficient implementation of gradient boosting that is widely used for classification tasks. It is known for its speed and performance, making it suitable for large data sets. Training the XGBoost model using the xgboost package in R. The model is trained on the training data set, and visualize the variable importance to understand which features contribute most to the classification. train function from the caret package was used to train the model with 10-fold cross-validation for hyperparameter tuning.

# Ensure response is a factor
trainDat[, response] <- as.factor(trainDat[, response])

# Define hyperparameter tuning grid 
xgb_grid <- expand.grid(
  nrounds = 100,
  max_depth = 6,
  eta = 0.3,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)

# Setting train control
control <- trainControl(
  method = "cv",         # k-fold CV
  number = 10,           # 10-fold
  summaryFunction = multiClassSummary,  # for multiclass metrics
  classProbs = TRUE,
  savePredictions = "final",
  sampling = "up",
  verboseIter = FALSE
)

# Train the XGBoost model using caret
set.seed(100)
model_xgb <- train(x = trainDat[, predictors],
                   y = trainDat[, response],
                   method = "xgbTree",
                   trControl = control,
                   tuneGrid = xgb_grid)
# Print model summary
kable(model_xgb$results, caption = "XGBoost Model Results") %>%
  kable_styling(full_width = F, position = "left")
XGBoost Model Results
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample logLoss AUC prAUC Accuracy Kappa Mean_F1 Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy logLossSD AUCSD prAUCSD AccuracySD KappaSD Mean_F1SD Mean_SensitivitySD Mean_SpecificitySD Mean_Pos_Pred_ValueSD Mean_Neg_Pred_ValueSD Mean_PrecisionSD Mean_RecallSD Mean_Detection_RateSD Mean_Balanced_AccuracySD
100 6 0.3 0 1 1 1 0.0283113 0.9997245 0.8622213 0.9929867 0.9892228 0.9708748 0.9697435 0.9990861 0.9750296 0.9990923 0.9750296 0.9697435 0.1241233 0.9844148 0.0222191 0.000436 0.0349824 0.0047978 0.0073673 0.0198403 0.0213631 0.0006256 0.015926 0.000619 0.015926 0.0213631 0.0005997 0.0109808
# Extract variable importance
importance_xgb <- xgb.importance(feature_names = predictors, 
                                 model = model_xgb$finalModel)

# Convert importance to a data frame
importance_df_xgb <- as.data.frame(importance_xgb)
importance_df_xgb <- importance_df_xgb %>%
  mutate(Variable = reorder(Feature, Gain)) %>%
  arrange(desc(Gain))

# Check the summary of the importance data frame
kable(importance_df_xgb, caption = "Variable Importance for XGBoost Model") %>%
  kable_styling(full_width = F, position = "left")
Variable Importance for XGBoost Model
Feature Gain Cover Frequency Variable
B05 0.2319615 0.1726225 0.1417647 B05
yellowness 0.1889079 0.1664336 0.1629412 yellowness
B02 0.1794137 0.1838177 0.1776471 B02
B03 0.1585555 0.1987152 0.1423529 B03
B07 0.1433434 0.1102423 0.1052941 B07
B04 0.0420511 0.1087894 0.0747059 B04
NDVI 0.0339618 0.0339363 0.0970588 NDVI
B06 0.0186276 0.0191703 0.0582353 B06
B08 0.0031775 0.0062727 0.0400000 B08
# Visualize variable importance using ggplot2
ggplot(importance_df_xgb, aes(x = Variable, y = Gain)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Variable Importance for XGBoost (High to Low)",
       x = "Variable", y = "Importance Score") +
  theme_minimal()

The XGBoost model is successfully trained and variable importance is noted for the determining which predictor variable (sentinel bands) has overall contributions to the classification. The variable importance scores for each predictor variable is derived. The variable importance plot shows which features contribute most to the classification task, with the B05, Yellowness & B02 being among the top 3 important predictors for the XGBoost model.

The XGBoost model has been trained successfully with the selected hyperparameter. The model achieved excellent performance having,

  • Least uncertainity (logLoss = 0.0283),

  • Nearly perfect classification (AUC score = 0.999),

  • Perfect agreement beyond chance (Kappa = 0.989),

  • High classification (Accuracy = 0.993),

  • Good overall class performance (balanced accuracy = 0.984), and

  • Balanced precision and recall (mean F1 score = 0.97).

These score are very similar to the Random Forest Model and will be compared in the results and discussion section.

Predicting the test data with confusion matrix and scoring metrics:
To evaluate the performance of the XGBoost model on the test data set, confusion matrix and calculate scoring metrics were calculated such as accuracy, sensitivity, and specificity. The confusion matrix provides insights into how well the model performs on the test data set.

# Predict on the test dataset
predictions_xgb <- predict(model_xgb, newdata = testDat[, predictors])

# Extract the levels from the training data response
true_levels_xgb <- levels(trainDat[[response]])

# Convert both predicted and actual responses to factors with the same levels
predictions_xgb <- factor(predictions_xgb, levels = true_levels_xgb)
actual_xgb <- factor(testDat[[response]], levels = true_levels_xgb)

# Now create the confusion matrix
confusion_xgb <- caret::confusionMatrix(predictions_xgb, actual_xgb)

# Plot confusion matrix
cm_plot_xgb <- ggplot(as.data.frame(confusion_xgb$table), aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Freq), color = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  geom_text(aes(label = Freq), color = "black") +
  labs(title = "Confusion Matrix for XGBoost Model",
       x = "Reference Class",
       y = "Predicted Class") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

cm_plot_xgb

# Print scoring metrics
kable(confusion_xgb$overall, caption = "Scoring Metrics for XGBoost Model") %>% 
  kable_styling(full_width = F, position = "left")
Scoring Metrics for XGBoost Model
x
Accuracy 0.9927172
Kappa 0.9887915
AccuracyLower 0.9905954
AccuracyUpper 0.9944654
AccuracyNull 0.5640382
AccuracyPValue 0.0000000
McnemarPValue NaN
# Print sensitivity and specificity for each class
kable(confusion_xgb$byClass, caption = "Sensitivity and Specificity for XGBoost Model") %>% 
  kable_styling(full_width = F, position = "left")
Sensitivity and Specificity for XGBoost Model
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: Bare 0.9824176 0.9985351 0.9759825 0.9989342 0.9759825 0.9824176 0.9791895 0.0571321 0.0561276 0.0575088 0.9904763
Class: Bush 0.9986357 0.9995851 0.9959184 0.9998617 0.9959184 0.9986357 0.9972752 0.0920392 0.0919136 0.0922903 0.9991104
Class: Forestry 0.9972900 1.0000000 1.0000000 0.9997233 1.0000000 0.9972900 0.9986431 0.0926670 0.0924159 0.0924159 0.9986450
Class: Gorse 0.9952681 1.0000000 1.0000000 0.9995909 1.0000000 0.9952681 0.9976285 0.0796082 0.0792315 0.0792315 0.9976341
Class: Grassland 1.0000000 0.9997317 0.9960861 1.0000000 0.9960861 1.0000000 0.9980392 0.0639126 0.0639126 0.0641637 0.9998659
Class: Sand 0.9185520 0.9980628 0.9311927 0.9976762 0.9311927 0.9185520 0.9248292 0.0277499 0.0254897 0.0273732 0.9583074
Class: Urban 0.8571429 0.9966590 0.8571429 0.9966590 0.8571429 0.8571429 0.8571429 0.0228528 0.0195881 0.0228528 0.9269009
Class: Water 1.0000000 0.9997120 0.9997774 1.0000000 0.9997774 1.0000000 0.9998887 0.5640382 0.5640382 0.5641637 0.9998560

Results
Confusion Matrix and scoring metrics provide insights into the performance of the XGBoost model on the test dataset. The confusion matrix shows how many instances were correctly classified for each land cover class, while the scoring metrics (accuracy, sensitivity, specificity) quantify the model’s overall performance.

Predictions on the test/unseen data for the XGBoost model, confusion matrix shows high true positives for all the classes predictions with an overall accuracy of 99.27%, but do have some irregularities in the classification. From the confusion matrix,

  • Water, Grassland, & Forestry are 100% predicted and classified correctly.

  • Out of 182 Urban Class, 15 were predicted as Sand and 11 as Bare

  • Out of 221 Sand Class, 18 were predicted as Urban

  • Out of 634 Gorse Class, 2 were predicted as Grassland and 1 as Urban

  • Out of 733 Bush Class, 1 was predicted as Water

  • Out of 455 Bare class, 7 were predicted as Urban and 1 as Bush.

This means that the Urban, Sand, Gorse, Bush and Bare show little false positives with the XGBoost model.

4.3 Support Vector Machine (SVM)

Support Vector Machine (SVM) is a supervised learning algorithm that can be used for classification tasks. It works by finding the hyperplane that best separates the classes in the feature space. SVM is effective in high-dimensional spaces and is suitable for this LC classification analysis. Training the SVM model using the e1071 package in R. The model is trained on the training data set, and visualize the variable importance to understand which features contribute most to the classification. train function from the caret package was used to train the model with 10-fold cross-validation for hyperparameter tuning.

# Ensure response is a factor and class weights are a list
trainDat[, response] <- as.factor(trainDat[, response])

# Define hyperparameter tuning grid 
svm_grid <- expand.grid(
  C = 2^(-1:1),         # Cost parameter
  sigma = 2^(-1:1))     # Kernel parameter

# Setting train control
control <- trainControl(
  method = "cv",         # k-fold CV
  number = 10,           # 10-fold
  summaryFunction = multiClassSummary,
  classProbs = TRUE,
  savePredictions = "final",
  sampling = "up")

# Train Support Vector Machine (SVM) model
set.seed(100)
model_svm <- train(x = trainDat[, predictors],
                   y = trainDat[, response],
                   method = "svmRadial",
                   trControl = control,
                   tuneGrid = svm_grid)
# Print model summary
kable(model_svm$results, caption = "Support Vector Machine Model Results") %>%
  kable_styling(full_width = F, position = "left")
Support Vector Machine Model Results
C sigma logLoss AUC prAUC Accuracy Kappa Mean_F1 Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy logLossSD AUCSD prAUCSD AccuracySD KappaSD Mean_F1SD Mean_SensitivitySD Mean_SpecificitySD Mean_Pos_Pred_ValueSD Mean_Neg_Pred_ValueSD Mean_PrecisionSD Mean_RecallSD Mean_Detection_RateSD Mean_Balanced_AccuracySD
1 0.5 0.5 1.303668 0.9287637 0.7638325 0.8514147 0.7737012 NaN 0.7151947 0.9808636 NaN 0.9813459 NaN 0.7151947 0.1064268 0.8480292 0.0169874 0.0167498 0.0204171 0.0059123 0.0091274 NA 0.0241865 0.0007683 NA 0.0007509 NA 0.0241865 0.0007390 0.0124584
4 1.0 0.5 1.294641 0.9382290 0.7742504 0.8531708 0.7763659 NaN 0.7187930 0.9810950 NaN 0.9815664 NaN 0.7187930 0.1066463 0.8499440 0.0201787 0.0117038 0.0155778 0.0057122 0.0087082 NA 0.0199254 0.0007519 NA 0.0007210 NA 0.0199254 0.0007140 0.0103050
7 2.0 0.5 1.284566 0.9387793 0.7762996 0.8546319 0.7785823 NaN 0.7244229 0.9812817 NaN 0.9817460 NaN 0.7244229 0.1068290 0.8528523 0.0197518 0.0127177 0.0158253 0.0077519 0.0117688 NA 0.0271449 0.0010165 NA 0.0009699 NA 0.0271449 0.0009690 0.0140707
2 0.5 1.0 1.267989 0.9068621 0.7192716 0.8508333 0.7728938 NaN 0.7154631 0.9808143 NaN 0.9812850 NaN 0.7154631 0.1063542 0.8481387 0.0152577 0.0280850 0.0423211 0.0058313 0.0089292 NA 0.0248521 0.0007649 NA 0.0007549 NA 0.0248521 0.0007289 0.0127806
5 1.0 1.0 1.258247 0.9096912 0.7258875 0.8531682 0.7764219 NaN 0.7224372 0.9811123 NaN 0.9815673 NaN 0.7224372 0.1066460 0.8517748 0.0226213 0.0312114 0.0474619 0.0050834 0.0077883 NA 0.0164739 0.0006658 NA 0.0006413 NA 0.0164739 0.0006354 0.0085204
8 2.0 1.0 1.248940 0.9048456 0.7143066 0.8537582 0.7773079 NaN 0.7196695 0.9811907 NaN 0.9816130 NaN 0.7196695 0.1067198 0.8504301 0.0256298 0.0313795 0.0414158 0.0088198 0.0131808 NA 0.0257316 0.0011465 NA 0.0010635 NA 0.0257316 0.0011025 0.0133953
3 0.5 2.0 1.253344 0.8683051 0.6853715 0.8479136 0.7684864 NaN 0.7074507 0.9804388 NaN 0.9809286 NaN 0.7074507 0.1059892 0.8439448 0.0203102 0.0104051 0.0206338 0.0043704 0.0065018 NA 0.0223733 0.0005722 NA 0.0005594 NA 0.0223733 0.0005463 0.0114568
6 1.0 2.0 1.246579 0.8706399 0.6880568 0.8499578 0.7715744 NaN 0.7133815 0.9807017 NaN 0.9811700 NaN 0.7133815 0.1062447 0.8470416 0.0191771 0.0124340 0.0201726 0.0050968 0.0076873 NA 0.0214185 0.0006621 NA 0.0006443 NA 0.0214185 0.0006371 0.0110071
9 2.0 2.0 1.242011 0.8712301 0.6902114 0.8505426 0.7724607 NaN 0.7155690 0.9807767 NaN 0.9812454 NaN 0.7155690 0.1063178 0.8481728 0.0138274 0.0117582 0.0214822 0.0047950 0.0072443 NA 0.0190069 0.0006270 NA 0.0006159 NA 0.0190069 0.0005994 0.0097863
# Extract variable importance
importance_svm <- varImp(model_svm)$importance

# Add variable names as a column
importance_svm$Variable <- rownames(importance_svm)

# Compute overall importance (if multiple classes)
importance_svm$Overall <- rowMeans(importance_svm[, sapply(importance_svm, is.numeric)])

# Round and sort in descending order
importance_svm <- importance_svm %>%
  mutate(Overall = round(Overall, 3)) %>%
  arrange(desc(Overall))  # High to low

# Print top variables
kable(importance_svm, caption = "Variable Importance for Support Vector Machine Model") %>%
  kable_styling(full_width = F, position = "left")
Variable Importance for Support Vector Machine Model
Bare Bush Forestry Gorse Grassland Sand Urban Water Variable Overall
B04 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 B04 100.000
NDVI 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 NDVI 100.000
yellowness 100.00000 99.91037 99.91037 99.91037 99.91037 100.00000 100.00000 100.00000 yellowness 99.955
B02 100.00000 99.61459 99.61459 100.00000 99.61459 99.61459 99.61459 100.00000 B02 99.759
B03 100.00000 98.08192 98.08192 99.26476 98.08192 98.08192 100.00000 100.00000 B03 98.949
B05 100.00000 80.18284 80.18284 80.18284 80.18284 100.00000 99.98892 100.00000 B05 90.090
B07 60.91243 99.97924 100.00000 60.91243 60.91243 100.00000 60.91243 99.97924 B07 80.451
B08 25.39213 99.95848 100.00000 25.39213 25.39213 100.00000 25.39213 99.95848 B08 62.686
B06 0.00000 99.23189 100.00000 0.00000 0.00000 100.00000 0.00000 99.23189 B06 49.808
# Plot variable importance
ggplot(importance_svm, aes(x = reorder(Variable, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Variable Importance for SVM model (High to Low)",
       x = "Variable", y = "Importance Score") +
  theme_minimal()

The SVM model is trained, summary results and variable importance are noted for the determining which predictor variable (sentinel bands) has overall contributions to this classification.The variable importance scores for each predictor variable is derived. The variable importance plot shows which features contribute most to the classification task, with the NDVI, B04 & Yellowness being among the top 3 important predictors for the XGBoost model.

Predicting the test data with confusion matrix and scoring metrics:
To evaluate the performance of the SVM model on the test data set, confusion matrix and scoring metrics such as accuracy, sensitivity, and specificity were calculted. The confusion matrix provides insights into how well the model performs on the test data set.

# Predict on the test dataset
predictions_svm <- predict(model_svm, 
                           newdata = testDat[, predictors])
probs_svm <- predict(model_svm, 
                     newdata = testDat[, predictors],
                     probability = TRUE)

# Extract the levels from the training data response
true_levels_svm <- levels(trainDat[[response]])

# Convert both predicted and actual responses to factors with the same levels
predictions_svm <- factor(predictions_svm, 
                          levels = true_levels_svm)
actual_svm <- factor(testDat[[response]], 
                     levels = true_levels_svm)

# Construct the confusion matrix
confusion_svm <- caret::confusionMatrix(predictions_svm, actual_svm)

# Plot confusion matrix
cm_plot_svm <- ggplot(as.data.frame(confusion_svm$table), aes(x = Reference, y = Prediction)) +
  geom_tile(aes(fill = Freq), color = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  geom_text(aes(label = Freq), color = "black") +
  labs(title = "Confusion Matrix for Support Vector Machine Model",
       x = "Reference Class",
       y = "Predicted Class") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plot Confusion Matrix
cm_plot_svm

# Print scoring metrics
kable(confusion_svm$overall, caption = "Scoring Metrics for Support Vector Machine Model") %>%
  kable_styling(full_width = F, position = "left")
Scoring Metrics for Support Vector Machine Model
x
Accuracy 0.8528378
Kappa 0.7756107
AccuracyLower 0.8448654
AccuracyUpper 0.8605510
AccuracyNull 0.5640382
AccuracyPValue 0.0000000
McnemarPValue NaN
# Print sensitivity and specificity for each class
kable(confusion_svm$byClass, caption = "Sensitivity and Specificity for Support Vector Machine Model") %>%
  kable_styling(full_width = F, position = "left")
Sensitivity and Specificity for Support Vector Machine Model
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: Bare 1.0000000 0.9957384 0.9342916 1.0000000 0.9342916 1.0000000 0.9660297 0.0571321 0.0571321 0.0611502 0.9978692
Class: Bush 1.0000000 0.9994468 0.9945726 1.0000000 0.9945726 1.0000000 0.9972789 0.0920392 0.0920392 0.0925414 0.9997234
Class: Forestry 0.9945799 1.0000000 1.0000000 0.9994467 1.0000000 0.9945799 0.9972826 0.0926670 0.0921647 0.0921647 0.9972900
Class: Gorse 0.0662461 0.9998636 0.9767442 0.9252620 0.9767442 0.0662461 0.1240768 0.0796082 0.0052737 0.0053993 0.5330548
Class: Grassland 0.0000000 1.0000000 NaN 0.9360874 NA 0.0000000 NA 0.0639126 0.0000000 0.0000000 0.5000000
Class: Sand 0.9683258 0.8545783 0.1597015 0.9989432 0.1597015 0.9683258 0.2741832 0.0277499 0.0268709 0.1682572 0.9114521
Class: Urban 0.6703297 0.9988435 0.9312977 0.9923401 0.9312977 0.6703297 0.7795527 0.0228528 0.0153189 0.0164490 0.8345866
Class: Water 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.5640382 0.5640382 0.5640382 1.0000000

Results
Confusion Matrix and scoring metrics provide insights into the performance of the Support Vector Machine (SVM) model on the test data set. The confusion matrix shows how many instances were highly misclassified for each land cover class, while the scoring metrics (accuracy, sensitivity, specificity) shows the model’s poor performance.

The SVM model has not been trained successfully because the model achieved poor performance having very low scores for the different scoring matrix. The confusion matrix shows that the model has not classified the defined land cover classes as there are lot of misclassification. Hence the model does not fit this land cover classification and will not be further used to compare the models and assess the results and discussions of this project.

5 Results & Discussions {#results-&-discussions}

5.1 Predicted maps

Now the predicted model maps are visualized to check the land cover classification and lay the training classes on the predicted models to verify the classification visually.

# Stacking Sentinel bands + indices
predictors_stack <- stack(sentinel$B02, 
                          sentinel$B03, 
                          sentinel$B04, 
                          sentinel$B05,
                          sentinel$B06, 
                          sentinel$B07, 
                          sentinel$B08,
                          sentinel$NDVI, 
                          sentinel$yellowness)

# Convert raster stack to dataframe (pixel wise)
pixels_df <- as.data.frame(predictors_stack, na.rm = TRUE)

# Predict for raster cells for each model
predictions_rf <- predict(model_rf, newdata = pixels_df, type = "raw")
predictions_xbg <- predict(model_xgb, newdata = pixels_df, type = "raw")
# Create raster layers for each model prediction
raster_rf  <- setValues(raster(predictors_stack), predictions_rf)
raster_xgb <- setValues(raster(predictors_stack), predictions_xbg)

# Define class names and palette
classes <- c("Bare", "Bush", "Forestry", "Gorse", "Grassland", "Sand", "Urban", "Water")# Replace with your actual column name
class_colors <- c(
  "Bare"       = "#a6643a",   # brown
  "Bush"       = "#d01c8b",   # magenta
  "Forestry"   = "#1b7837",   # dark green
  "Gorse"      = "#ffff00",   # yellow
  "Grassland"  = "#4dac26",   # light green
  "Sand"       = "#f6e8c3",   # sand yellow
  "Urban"      = "#878787",   # gray
  "Water"      = "#4393c3"    # blue
)
palette <- unname(class_colors[classes])

# Convert to SpatRaster (terra) and assign class levels
raster_rf  <- rast(raster_rf);  levels(raster_rf)  <- data.frame(value = 1:8, class = classes)

raster_xgb <- rast(raster_xgb); levels(raster_xgb) <- data.frame(value = 1:8, class = classes)

mapview(raster_rf, col.regions = palette, na.color = NA) + 
  mapview(training, zcol = "Class", col.regions = "black", alpha = 0.5)
mapview(raster_xgb, col.regions = palette, na.color = NA) + 
  mapview(training, zcol = "Class", col.regions = "black", alpha = 0.5)

Discussions:
The predicted thematic maps are overlayed with the training land cover classes to visually validated the model performance. Based on the maps, although there are slight misclassifications, both the models shows utmost accurate classification as indicated by the confusion matrix and scoring metrics. Out of the two, Random Forest shows better performance in land cover classification.

5.2 Model Performance & Uncertainly Analysis

Uncertainty analysis aims to quantify the reliability of predicted land cover maps derived from remote sensing data. It involves assessing and managing errors and uncertainties associated with the classification process, including those arising from data acquisition, processing, and the classification algorithms themselves. 

1. Confusion matrices & Scoring metrics

Comparing the Confusion matrices of the models to verify each model performed in classifying the land cover types. The confusion matrices show the number of true positives, false positives, true negatives, and false negatives for each class. This allows to assess the strengths and weaknesses of each model in terms of classification accuracy.

# Create a list of confusion matrices for each model
cm_plot_rf

cm_plot_xgb

Compare the scoring metrics (precision, f1score, recall, accuracy, sensitivity, specificity) of the RF & XGBoost models to see which model performed best overall. The accuracy metric indicates the overall correctness of the model, while sensitivity and specificity provide insights into how well the model identifies each class.

# Create a data frame to hold the scoring metrics for each model from the test dataset
scoring_df <- data.frame(
  Metric = c('Accuracy', 'Kappa', 'Precision', 'Sensitivity', 'Specificity','F1'),
  RandomForest = c(confusion_rf$overall['Accuracy'],
                   confusion_rf$overall['Kappa'],
                   mean(confusion_rf$byClass[,"Precision"]),
                   mean(confusion_rf$byClass[,"Sensitivity"]),
                   mean(confusion_rf$byClass[,"Specificity"]),
                   mean(confusion_rf$byClass[,"F1"])),
  XGBoost = c(confusion_xgb$overall['Accuracy'],
              confusion_xgb$overall['Kappa'],
              mean(confusion_xgb$byClass[,"Precision"]),
              mean(confusion_xgb$byClass[,"Sensitivity"]),
              mean(confusion_xgb$byClass[,"Specificity"]),
              mean(confusion_rf$byClass[,"F1"])))

# Print the scoring metrics data frame
kable(scoring_df, caption = "Scoring Metrics for Each Model") %>%
  kable_styling(full_width = F, position = "left")
Scoring Metrics for Each Model
Metric RandomForest XGBoost
Accuracy 0.9949774 0.9927172
Kappa 0.9922706 0.9887915
Precision 0.9785630 0.9695125
Sensitivity 0.9769719 0.9686633
Specificity 0.9993490 0.9990357
F1 0.9777535 0.9777535
# Convert to long format for plotting
scoring_df_long <- melt(scoring_df, id.vars = "Metric",
                        variable.name = "Model", value.name = "Score")
# Plot the scoring metrics for each model
ggplot(scoring_df_long, aes(x = Metric, y = Score, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Scoring Metrics Comparison between the models",
       x = "Metrix", y = "Score") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Discussion - Model Performance:
Confusion matrix from Random Forest and XGBoost are shown here. As per the results from the confusion matrix, the Random Forest Model scores high true positives and very low false positives, when compared with the XGBoost model. Furthermore, the scoring metrics provide a comprehensive overview of the performance of each model. The Random Forest model achieved the highest accuracy compared to the XGBoost. The precision and sensitivity scores indicate that Random Forest performed well in identifying the gorse class than the XGBoost. The F1 score, which balances precision and recall, also shows that Random Forest outperformed XGBoost.

2. Probability-based Entropy Maps (for uncertainty)
For this uncertainity analysis, first predicting the class probabilities for the models is required. With the predicted class probabilities, entropy per pixel is calculated and then map these entropy values as a raster data.

# Preparing the data
# Convert the raster predictors to dataframe
pred_data <- as.data.frame(predictors_stack, na.rm = TRUE)
valid_idx <- which(complete.cases(pred_data))
pred_data_valid <- pred_data[valid_idx, ]

# Predicting the model probabilities
prob_rf <- predict(model_rf, pred_data, type='prob')
prob_xgb <- predict(model_xgb, pred_data, type='prob')

# Computing the entropy for the models
compute_entropy <- function (p) {
  epsilon <- 1e-12 # Avoid log 0
  p <- pmax(pmin(p, 1 - epsilon), epsilon)  # Clip values between epsilon and 1 - epsilon
  -rowSums(p * log2(p))
}

# Apply the model probabilitites to the entropy function
entropy_rf <- compute_entropy(as.matrix(prob_rf))
entropy_xgb <- compute_entropy(as.matrix(prob_xgb))

# Setting up the plot
# Create templates for the model
entropy_raster_rf <- raster::raster(predictors_stack)
entropy_raster_xgb <- raster::raster(predictors_stack)

# Setting all values to NA
values(entropy_raster_rf) <- NA
values(entropy_raster_xgb) <- NA

# Inserting the entropy values at valid locations
values(entropy_raster_rf)[valid_idx] <- entropy_rf
values(entropy_raster_xgb)[valid_idx] <- entropy_xgb

# Convert raster to data frame for ggplot
# Random Forest
df_entropy_rf <- as.data.frame(entropy_raster_rf, 
                               xy = TRUE, na.rm = TRUE)
colnames(df_entropy_rf)[3] <- "entropy"
# XGBoost
df_entropy_xgb <- as.data.frame(entropy_raster_xgb, 
                                xy = TRUE, na.rm = TRUE)
colnames(df_entropy_xgb)[3] <- "entropy"

# Create the ggplot for Random Forest entropy
p_rf <- ggplot(df_entropy_rf, aes(x = x, y = y, fill = entropy)) +
  geom_raster() +
  scale_fill_viridis(option = "viridis") +
  coord_equal() +
  labs(title = "Entropy - Random Forest", fill = "Entropy") +
  theme_minimal()

# Create the ggplot for XGBoost entropy
p_xgb <- ggplot(df_entropy_xgb, aes(x = x, y = y, fill = entropy)) +
  geom_raster() +
  scale_fill_viridis(option = "viridis") +
  coord_equal() +
  labs(title = "Entropy - XGBoost", fill = "Entropy") +
  theme_minimal()         

# Plot the entropy maps
par(mfrow=c(1,2))
p_rf

p_xgb

par(mfrow=c(1,1))                               

Discussion
These entropy maps shows the uncertainity in the predictions in the model. More yellow the higher entropy and more purple the lower entropy. Low entropy signifies the confidence in predictions where high entropy shows the uncertaininty and might be similar probabilities to multiple class. In these maps, RF entropy maps shows more gradual transitions and less extreme entropy which means that the model could be robust to noise and less flexible. On the contrary, XGBoost shows sharper contrasts and more high entropy areas particularty at the class boundaries. This could mean the model classification is sensitive to the class boundaries and also could be the problem of overfitting.

3. DALEX Uncertainity Analysis
To gain insights into the uncertainty of the model predictions, the DALEX (Descriptive mAchine Learning EXplanations) package was designed to help understand and explain complex machine learning models. DALEX allows to explore the model diagnostics for each model. This allows to analyze the model’s predictions and understand how different features contribute to the uncertainty in the predictions.

# DALEX - Prediction Diagnostics
# Define a safe predict function
predict_class <- function(m, d) predict(m, d, type = "raw")

# Create explainers for each model using DALEX
explainer_rf <- DALEX::explain(model_rf, 
                               data = testDat[, predictors], 
                               y = testDat$Class, 
                               label = "Random Forest")
## Preparation of a new explainer is initiated
##   -> model label       :  Random Forest 
##   -> data              :  7964  rows  9  cols 
##   -> target variable   :  7964  values 
##   -> predict function  :  yhat.train  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package caret , ver. 7.0.1 , task multiclass (  default  ) 
##   -> predicted values  :  predict function returns multiple columns:  8  (  default  ) 
##   -> residual function :  difference between 1 and probability of true class (  default  )
##   -> residuals         :  numerical, min =  0 , mean =  0.01093245 , max =  0.996  
##   A new explainer has been created!
explainer_xgb <- DALEX::explain(model_xgb, 
                                data = testDat[, predictors], 
                                y = testDat$Class, 
                                label = "XGBoost")
## Preparation of a new explainer is initiated
##   -> model label       :  XGBoost 
##   -> data              :  7964  rows  9  cols 
##   -> target variable   :  7964  values 
##   -> predict function  :  yhat.train  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package caret , ver. 7.0.1 , task multiclass (  default  ) 
##   -> predicted values  :  predict function returns multiple columns:  8  (  default  ) 
##   -> residual function :  difference between 1 and probability of true class (  default  )
##   -> residuals         :  numerical, min =  1.502037e-05 , mean =  0.007651202 , max =  0.9999674  
##   A new explainer has been created!
# Take 5 random observations from the test set
set.seed(123)
new_obs <- testDat[1, predictors]
# Compute diagnostics for each model
diag_rf <- predict_diagnostics(explainer_rf, 
                               new_observation =  new_obs, N=NULL)
diag_xgb <- predict_diagnostics(explainer_xgb, 
                                new_observation = new_obs, N=NULL)

# Plotting uncertainty histograms
hist_rf <- rbind(diag_rf$histogram_neighbors, diag_rf$histogram_all)
hist_rf$model <- "Random Forest"

hist_xgb <- rbind(diag_xgb$histogram_neighbors, diag_xgb$histogram_all)
hist_xgb$model <- "XGBoost"

# Combine all
combined_hist <- rbind(hist_rf, hist_xgb)

# Plotting the histogram 
ggplot(combined_hist, aes(x = Var1, y = Freq, fill = direction)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~model, scales = "free_y") +
  scale_fill_manual(values = c("neighbors" = "steelblue", "all" = "tomato")) +
  theme_minimal() +
  labs(title = "Similarity to Training Data",
       x = "Distance Bin",
       y = "Frequency",
       fill = "Compared To") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, , size = 5))

Discussion:
Both the models show a sharp spike at very low bin distance bins which means that the test dataset is highly similar to the training dataset, particularly in the XGBoost. This suggests that both models use very similar training data to make predictions. In the RF model, the model red bar are all negative, which shows that the test data as a whole is less similar to the entire training data distribution. This suggests that the RF model has some overfitting and learned well from local structures. From both the plots, RF has more spread across the bin whereas the XGBoost has a sharp spike at the lowest bin. This shows that the XGBoost heavily relies on few highly similar instances, whereas RF has spreads its attention.

6 Conclusion

The project aim is to predict the invasive species gorse (Ulex europaeus) on the Banks Peninsula in New Zealand. Based this project goal, using the Sentinel satellite images and plotted land cover class data shapes of the study area, three machine learning models (Random Forest (RF), XGBoost & Support Vector Machine (SVM) are selected and trained with similar training control methods (Cross validation method and k-fold at 10). Out of the three models, RF and XGBoost model performs very well and SVM model is not upto par in classifying the land cover class using the satellite images.

Based on various analysis such as confusion matrix, scoring metrics, predictive & entropy maps, and uncertainy diagonostics, RF model has low uncertainty and predicts the land cover classes with the highest accuracy of 99.33%, when compared to the XGBoost. In conclusion, Random Forest Model outperforms XGBoost in predicting the invasive species gorse (Ulex europaeus), using Sentinel satellite data, on the Banks Peninsula of New Zealand.